Improving Classification of Multi-Lingual Web Documents using Domain Ontologies

نویسندگان

Marina Litvak

Mark Last

Slava Kisilevich

چکیده

In this paper, we deal with the problem of analyzing and classifying web documents to several major categories/classes in a given domain using domain ontology. We present the ontology-based web content mining methodology that contains such main stages as collecting a training set of labeled documents from a given domain, building a classification model above this domain given the domain ontology, and classification of new documents via the induced model. We tested the proposed methodology in a specific domain, namely web pages containing information about production of certain chemicals. Using our methodology, we are interested to identify all relevant web documents while ignoring the documents that do not contain any relevant information. Our system receives as input an OWL file built in Protege tool, which contains the domain-specific ontology, and a set of web documents classified by a human expert as ”relevant” or ”non-relevant”. We use a language-independent key-phrase extractor with integrated ontology parser (defined in a given language) for creating the database from input documents and use it as a training set for the classification algorithm. The system classification accuracy using various levels of ontology is evaluated.The current version of our system supports web content mining in English, Arabic, Russian, and Hebrew languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Classification of Web Documents Using Concept Extraction from Ontologies

In this paper, we deal with the problem of analyzing and classifying web documents in a given domain by information filtering agents. We present the ontology-based web content mining methodology that contains such main stages as creation of ontology for the specified domain, collecting a training set of labeled documents, building a classification model in this domain using the constructed onto...

متن کامل

Using Multiple Related Ontologies in an Fuzzy Information Retrieval Model

With the Semantic Web progress many independently developed distinct domain ontologies have to be shared and reused by a variety of applications. The use of ontologies in information retrieval applications allows the retrieval of semantically related documents to an initial users’ query. This work presents a fuzzy information retrieval model for improving the document retrieval process consider...

متن کامل

Enriching Ontologies with Encyclopedic Background Knowledge for Document Indexing

The rapidly increasing number of scientific documents available publicly on the Internet creates the challenge of efficiently organizing and indexing these documents. Due to the time consuming and tedious nature of manual classification and indexing, there is a need for better methods to automate this process. This thesis proposes an approach which leverages encyclopedic background knowledge fo...

متن کامل

Multilingual Medical Documents Classification Based on MesH Domain Ontology

This article deals with the semantic Web and ontologies. It addresses the issue of the classification of multilingual Web documents, based on domain ontology. The objective is being able, using a model, to classify documents in different languages. We will try to solve this problematic using two different approaches. The two approaches will have two elementary stages: the creation of the model ...

متن کامل

Mapping Persian Words to WordNet Synsets

Lexical ontologies are one of the main resources for developing natural language processing and semantic web applications. Mapping lexical ontologies of different languages is very important for inter-lingual tasks. On the other hand mapping approaches can be implied to build lexical ontologies for a new language based on pre-existing resources of other languages. In this paper we propose a sem...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Improving Classification of Multi-Lingual Web Documents using Domain Ontologies

نویسندگان

چکیده

منابع مشابه

Classification of Web Documents Using Concept Extraction from Ontologies

Using Multiple Related Ontologies in an Fuzzy Information Retrieval Model

Enriching Ontologies with Encyclopedic Background Knowledge for Document Indexing

Multilingual Medical Documents Classification Based on MesH Domain Ontology

Mapping Persian Words to WordNet Synsets

عنوان ژورنال:

اشتراک گذاری